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' Although event-driven algorithms have been shown to be far more efficient than time-driven 

, methods such as conventional molecular dynamics, they have not become as popular. The main 

obstacle seems to be the difficulty of parallelizing event-driven molecular dynamics. Several basic 
ideas have been discussed in recent years, but to our knowledge no complete implementation has 
^Ip^ , been published yet. In this paper we present a parallel event-driven algorithm including dynamic 

load-balancing, which can be easily implemented on any computer architecture. To simplify matters 
our explanations refer to a basic multi-particle system of hard spheres, but can be extended easily 
to a wide variety of possible models. 
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I. INTRODUCTION 



Event-driven molecular dynamics is an effective algorithm for the simulation of many-component systems, which 
evolve independently, except for discrete asynchronous instantaneous interactions. As an example we will discuss a 
system consisting of N hard spheres in a box with periodic boundary conditions, but the algorithm can be extended to 
particles with different shapes and interacting via any piecewise constant potential or to completely different problems 
as well [ij. 

^ Event-driven molecular dynamics processes a series of events asynchronously one after another. A straight-forward 

but simplistic approach ^] updates all particles at each event; many advanced versions are available in text-books 
(— I . 0, . We base our algorithm on the sophisticated algorithm as presented by Q , which updates only those particles 
O i' involved in an event. It has been succesfuUy applied to many different problems, among them granular gases [^Qj 
polymer chains |^, tethered membranes ;2,i9j, protein folding 10], and battlefield models 11]. 

In many physical systems the duration of the interaction of the components, e. g. the collision of two particles, 
^ ] is negligible compared to the time between these interactions. The simulation of such systems with traditional 
• time-driven molecular dynamics is highly inefficient. Instead, it is straight-forward to consider the interactions as 
' instanteneous events and solve the problem with an event-driven algorithm. 

I One reason why event-driven molecular dynamics has not become as popular as conventional time-driven molecular 
dynamics is the fact that parallelization is extremely complicated. The paradoxical task is to algorithmically parallelize 
physically non-parallel dynamics. Nevertheless some ideas and basic considerations about the parallelization have 
been proposed in [l^ . They ar e esp ecially suited for shared memory computers, but can be transferred to distributed 
memory architectures as well Apart from those ideas no full and general implementation of a parallelized 

algorithm including load-balancing has been published yet, to our knowledge. In this paper we present an algorithm, 
which is based on those ideas, but is enhanced and completed at several points. It can be implemented with generic 
tools such as MPI, and therefore it is suited for shared and distributed memory architectures alike. 
'' In section ^] we explain the details of the implementation of event-driven molecular dynamics. The parallelization 
O |. of the algorithm is presented in section UTTI and a summary is given in section Hvl 



II. EVENT-DRIVEN MOLECULAR DYNAMICS 



Section FlI Al gives an outline of the main routine of the algorithm, which is composed of 4 steps (see Fig.^. Then 
the most important data structures are introduced in section FlI Bl and hereby step I and 4 are treated. Step 2 and 3 
are discussed in sections III CI and III Dl respectively. Finally, section Hi El deals with performance issues. 



A. Outline 



Event-driven molecular dynamics processes a series of discrete events. In a system of hard spheres events typically 
refer to instantaneous collisions, involving exactly two particles. Only these two particles are processed; the state of 
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event processing loop: 

1. get next event from priority queue current time 

2. update states of both particles to current time 

3. calculate new future events for both particles 

4. schedule next event for both particles in priority queue 

5. goto 1 



FIG. 1: Outline of the main routine of the algorithm 



particle states 


: A^x {to, r(to), v(to), • • • , counter, cell, id} 


event list 


: A'^x {tev, type, partner, counter(partner)} 



FIG. 2: Data structures for A'^ particles 



the other particles remains unchanged. So the state information of most particles refers not to the current time, but 
to a different point of time in the past. 

Event processing includes a state update for the concerned particles (see section III C|l and the calculation of the 
next future events for those particles (see section Hi D|) . An outhne of the algorithm can be seen in Fig.Q] 

The serial algorithm only adds to this event processing loop an initialization step at the beginning and periodic 
data output. The parallel algorithm needs several additional routines, which are described in section ITTll 

B. Data Organization 

The algorithm maintains two data fields, particle states and event list, which contain exactly one entry per particle 
(see Fig.EJ. The former refers to the past of a particle, the latter to its future {-^ Vi : state time io(*) < current time < 
event time tcv(*))- 

A particle state consists of the physical state of a particle, such as position, velocity, etc. immediately after the most 
recently processed event of that particle, the point of time of that event, an event counter (see section III CL a cell 
number (see section III Dll . and a global particle number, which is needed to identify particles on different processes 
(see section InH) . 

The event list associates with every particle an event in the future. The data units of the event list consist of event 
time, event type, event partner, and a copy of the event counter of the partner. 

All the events are scheduled in a priority queue which is usually implemented as an implicit heap tree 0, ll^ ll^ , 
but other data structures with similar characteristics are possible, too 0, 0| . A heap tree is a binary tree with 
the following ordering rule: Each parental node has higher priority than its children; the children themselves are not 
ordered. So the root node always has highest priority, i. e. the earliest event time. To get the next event from the 
priority queue takes computational costs of O (1). Insertion of a new event in the priority queue is done with costs of 
OilogN). _ ^ . . 

This data organization is more efficient compared to the algorithm in Il2j , since no double buffering is needed 
here, however, the basic ideas of 0,01 are still valid. 

C. State Update 

When a collision between two particles is processed, the states of these particles are updated from a point of time 
in the past to the current simulation time. First, the positions and velocities of the particles immediately before the 
collision can be derived by inserting the old state in the equation of motion. As a result of this first step the particles 
are touching each other. Then the interaction between the particles takes place and yields new particle velocities. 
Now, the event counter is increased, the state time is updated to the current simulation time, and the values of the 
positions and velocities immediately after the collision are stored in the state array. 



3 



A typical collision rule for hard spheres 0, 0| looks like 

1 + r / \ 

v'l/2 = Vi/2 T (^k • (Vi - V2)j k , 

where primes indicate the velocities v after the collision, k is a unit vector pointing along the line of centers from 
particle 1 to particle 2, and r denotes the restitution coefhcient. For the performance tests in the subsequent chapters 
we have used the simplest case without dissipation (r — 1). |23| 

If the event partner has undergone collisions with other particles between the scheduling and the processing of the 
event, it becomes invalid. Validity of event partners can be checked by comparing the event counter of the partner 
to the copy in the event list. If they do not match, the partner has collided with other particles in the meantime, 
and the scheduled collision is no longer valid. Then event processing only refers to one particle and the state update 
described above only consists of the particle motion; no collision is performed. After that the algorithm continues in 
the normal way, i. e. the next event for the particle will be calculated and scheduled in the priority queue. 

Note that the algorithm in 0,^2 '^^^^ ^ different strategy to detect invalid event partners: The algorithm checks at 
each collision if an event partner becomes invalid and if so marks this partner. This strategy is less efficient, because 
sometimes the same partner is invalidated several times. Besides, in the parallel algorithm additional communication 
between different processes might be necessary. 

D. Event Calculation and Linked Cell Structure 

When an event has been processed, new events for the particles involved in that event have to be calculated. In 
simulations of hard spheres this means the calculation of possible future collisions. If the particles move on ballistic 
trajectories 

r,(t) r,(to) + v,(to) it - to) + ^g{t- tof , 
two particles 1 and 2 will collide at time ti2: 

ti2 ^to+ (^-ri2 ■ vi2 - ^J (ri2 ■ vi2)2 - {rj^ - (R^ + R2Y) /vl^ , 

where V12 — V2{to) — Vi(io) and ri2 — r2(to) — ^i{to) are the relative velocities and positions of the particles at time 
to, and Ri are the radii of the particles. If ti2 is imaginary or smaller than to, the particles will not collide. 

If the algorithm would have to check for collisions with all other particles, performance would be very poor for large 
numbers N of particles. So we divide simulation space in C cells with equal sides. Each particle belongs to the cell 
in which its center lies. If the cell size is larger than the maximal interaction length of the particles, i.e. the particle 
diameter in the case of hard spheres, event calculation has to check for possible collisions of two particles only if they 
belong to the same cell or if their cells are neighbors. (One square cell has 8 neighbors in 2D and a cube has 26 
neighbors in 3D.) 

However, the algorithm has to keep track of cell changes. So additional events, namely cell changes, come into play. 
They are treated just in the same way as the collision events; only the collision partner is the boundary of a cell. The 
difference with a collision event is that only one particle is involved in a cell change, and the velocity of the particle 
does not change at event time. Cell changes at the boundary of the simulation space require that the position vector 
jumps to the opposite side of the simulation volume due to the periodic boundary conditions. 

E. Optimal Cell Numbers 

On average there are N/C — 1 particles in the neighborhood of each particle. So in the limit C <^ N this number 
becomes very large and many possible events have to be calculated after each collision. On the other hand, if C ;2> iV, 
then each particle has to cross many cell boundaries between a collision of two particles, and thus more events have to 
be treated to complete the simulation. These two contributions compete with each other, the first one is proportional 
to {C/N)~^ and the second one is proportional to (C/A^)^/^. Fig. O (left) shows a broad minimum of the simulation 
time for low densities at the optimal cell number C/N « 1.5 in 2D and C/N « 8 in 3D. So the program can choose an 
optimal cell number before the calculation starts. Note that in the high density limit, especially in 3D, the optimal 
cell number cannot be reached, because the size of the cells has to be larger than the particle diameter (see Fig. O 
(right)). So, in this case C should simply be chosen as big as possible. 
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C/N C/N 

FIG. 3: Simulation time tcpv (arbitrary units) plotted against the number C of cells in 2D and 3D with the serial algorithm. 
Used are A'' = 10000 particles (2D) and A*' — 8000 particles (3D). The left figure shows low densities (volume fraction v — 0.008 
(2D) and = 0.0005 (3D), the right figure shows high densities {u = 0.8 (2D) and u = 0.5 (3D)). 

III. THE PARALLEL ALGORITHM 

In section Fill Al we demonstrate how parallelization can be achieved via domain decomposition and dynamic load- 
balancing. Then, in section lTlI BI we point out the difficulties that arise in parallelizing this algorithm and the necessity 
of state saving and error recovery (see section UlI C|l . In section UlI Dl the parallel algorithm is explained in detail. 
Finally, some performance issues are discussed in section Fill El 

A. Domain Decomposition and Dynamic Load-Balancing 

Parallelization is achieved via domain decomposition. Each cell is affiliated with a process, but this affiliation can 
change during the simulation if the load of the processes has to be rebalanced. 

In order to realize maximal performance, the computational load should be distributed equally among the processes. 
If inhomogenities exist from the very beginning or emerge during the simulation, dynamic load-balancing becomes 
important. A typical example is cluster formation in granular gases. Il8l| 

Domain decomposition should thus take into account the following points: Firstly, the load of the processes should 
be distributed homogeneously. Secondly, in order to minimize process communication, the border area of each process 
domain should be as small as possible, which implies that the ideal process domains are squares or cubes. Thirdly, a 
simple and fast function that assigns a cell to a process and vice versa is required. 

We meet these demands in the following way: Cell numbers are assigned to cell coordinates in a tree-like structure, 
see a small 2D-example in Fig.0] (Note that a realistic example has thousands or millions of cells.) These cell numbers 
are obtained by interleaving all the bits 1 to fc of the cell coordinates: a;[A;], . . . , a:;[l]. The result is 

a cell number in the range to 2'^^ — 1. (For convenience these numbers are increased by one in Fig. 0]) 

Then a block of 2" consecutive cell numbers is affiliated with each process, where n can be different for each process. 
Suppose the cells 1-16 in Fig.^jshould be distributed over 4 processes. If a lot of particles are aggregated in the lower 
right corner, load-balancing could result in the following layout: Process I is affiliated with cells 1-8, process II with 
cells 9-12, process III with cells 13-14, and process IV with cells 15-16. 

Every now and then the load of the processes has to be checked. A reasonable and simple measure of the load is the 
number of particles or the number of collisions in the process domain. If a restructuring of the domain decomposition 
would result in a significantly better load-balance, three processes, respectively, exchange information about their 
particle and cell data, so that two light-weight neighboring processes can merge their domains and a heavy-weight 
process can split its domain in two halves. In the example above, if the initially inhomogeneous system becomes 
homogeneous, this procedure could result in the merging of the domains of processes III and IV and a splitting of the 
domain of process I. Process II does not participate in this rebalancing, but it should be informed that its neighbor 
processes have changed, of course. In the end the layout would look like that: Process I is affiliated with cells 1-4, 
process II with cells 9-12, process III with cells 5-8, and process IV with cells 13-16. 
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FIG. 4: Cell numbering layout in 2D with an example of domain decomposition. Inhomogeneous load distribution leads to 
unequal domain sizes. 

B. Causal Order 

A parallel approach to simulate asynchronous events has to make use of the concept of local times. Each process 
handles the events in its domain and thereby increases the local simulation time. When particles cross domain 
boundaries, the affected processes communicate with each other and events are inserted into or removed from the 
event lists of the processes. If the event time of a newly inserted particle is less than the local simulation time, 
causality is violated, because a collision of that particle with another one, which has already been processed, could 
have been missed. 

In general, there are two strategies to circumvent this problem: In a conservative approach only those processes 
that are guaranteed not to violate causality are allowed to handle their events, the rest of the processes has to idle. In 
an optimistic approach the processes do not have to idle. If causality is violated, a complex rollback protocol which 
corrects the erroneous computations has to be executed. Whether a conservative approach is efficient or not depends 
highly on the maximal event propagation speed. 

Unfortunately, in event-driven molecular dynamics there is no upper limit for the speed of events propagating along 
a chain of particles |l2l |. even if the particles themselves are moving very slowly. In other words, in the conservative 
case one process is operating and all others are idling and we are back to the serial algorithm. So we are left with the 
optimistic strategy and have to undertake the task of implementing a rollback protocol. 

C. State Saving and Error Recovery 

When a causality error is detected, the simulation is restarted from the latest saved state. So the algorithm has to 
make a backup copy of the simulation data periodically and has to ensure that there is no latent causality error in this 
backup. The latter is guaranteed if all processes are synchronized at saving time. This means that only those processes 
are allowed to continue their computations whose local simulation times have not yet reached synchronization time. 
Note that there are other operations, like e. g. data output or load-balancing, which require periodic synchronization 
anyway. Besides, without synchronization the local simulation times tend to drift apart, which makes causality errors 
very likely [l^ . 

To find the optimal backup interval, we make use of an adaptive strategy. If no causality error turns up between two 
successive save operations, the interval increases, otherwise it decreases. If other operations trigger synchronization, 
this occasion is used for state saving, too, of course. 

If a causal error occurs, all processes perform a rollback to the saved state. Furthermore a synchronization barrier 
is scheduled for the time when the error occured. This prevents the same causality error from happening again. Of 
course, another error could occur at an earlier point of time. Then the simulation would perform another rollback 
and an earlier resynchronization would be scheduled. 

For comparison the algorithm described in 12:] needs two backup copies of the simulation data. So our strategy 
reduces memory consumption further by a factor 3/2. 
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parallel loop: 

1. communication about border zone events — > timestep 

2. timestep <— min(timestep,time(synchronous tasks)) 

3. for(timestep) event processing loop (see Fig. Ql, 
send border zone particle information 

4. receive and process particle informations 

5. error detection 

6. if(error) rollback 

7. if (global min(current time) — time(synclironous tasks)) 
state saving, load-balancing, data output, . . . 

8. goto 1 



FIG. 5: Outline of the parallel algorithm 

D. Border Zone and Process Communication 

The border zone (in 'l2] it is called insulation layer) consists of the cells Cbordcr whose neighbors belong to a 
different process. With the domain decomposition described in section Fill Al there is always a "monolayer" of border 
zone cells at the boundary of a process area. 

Each process thus maintains a list of virtual border zone cells which actually belong to its neighboring processes 
and a list of virtual particles residing in the virtual border zone cells. Those virtual particles can act as partners for 
real particles. But no events are calculated for them directly and they are not represented in the event list, since 
they are real particles on another process. The events are already calculated there and will be communicated to the 
adjacent processes. 

However, it is highly inefficient for a process to communicate after every event with its neighbors. So the parallel 
algorithm is designed in a stepwise manner (see Fig.O: Each computing step lasts until the next event in the border 
zone of a process, which is obtained in the following way: An event is associated with each particle. Each event 
belongs to a cell (or two adjacent cells, if two colliding particles reside in different cells or if a particle changes from 
one cell to another). The smallest event time in a cell is called the cell time iccU- These cell times are stored in an 
additional priority queue similar to the event list in section lllljl This list contains the scheduled cell times for all 
local border zone cells and returns the minimum time tstep = miuc (iccu(c)) — iccii(c*) which belongs to cell c*. 

Now, this time is used as the preliminary length of the calculation step. But firstly, the neighboring cells of c*, 
if located on other processes are checked. If they have scheduled an even earlier event, ^stcp will be shortened to 
that event. The communication is done in the following way: Each process sends c* and iceii(c*) as a query to the 
neighboring processes, which check the adjacent cells for their event times and reply with the minimum thereof. If this 
answer is smaller, then it is used as tstep instead (see Fig. El step 1). Periodic tasks like data output, load-balancing 
or state saving, which require synchronization of the processes can shorten tgtcp even more (see Fig. (Sj step 2). 

One could think of more rigid policies to determine the length of a step, see e.g. 0|. We have also tried other 
strategies that reduce the number of rollbacks. But on the other hand, they reduce parallelism, too. A large part of 
the processes are idling, the communication overhead increases, and the overall performance goes down. 

After the communication phase, parallel event processing starts and proceeds until the calculated point of time 
^stcp (see Fig. El step 3), i. e. on average O (C/Cborder) iterations are processed. If the algorithm encounters an earlier 
border zone event which has not been anticipated, the event processing step is stopped immediately to prevent the 
occurence of a causality error. But normally, the last processed event is a regular border zone event. Then, after the 
last particle state update, the state information has to be communicated to the neighboring processes. 

When a process has finished its computing step, it reads the particle state messages that it has received during this 
step and adapts its data structures accordingly (see Fig. |31 step 4). Real particles can only be affected as collision 
partners of virtual particles. For a virtual particle there are several possibilities: A virtual particle can become a 
real particle by changing from one process to another, it can remain a virtual particle, but with a different position, 
velocity or cell number, or finally it can emerge on or disappear from the virtual particle list, because it enters or 
leaves the border zone, respectively. If a virtual particle becomes real and newly calculated events for this particle 
violate causality, i. e. they are happening before the local simulation time, an error is signaled. Moreover, the processes 
check the border zone cell times and compare them with the replies to their neighboring processes. If the actual cell 
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time is smaller than the reply (which corresponds to the current time of the neighbor process), a causality error has 
already happened with high probability, because a collision has been missed on the neighbor process, and an error is 
signaled, too (see Fig. El step 5). 

If no error is detected, the simulation continues with the next step. Otherwise a global rollback is performed (see 
Fig.El step 6) and the erroneous simulation step is restarted from the latest saved state (see section lTlI Cp . To prevent 
the reappearance of the same error, a synchronization barrier is issued at the time when the error occured. 

Other synchronization barriers can stem from periodic tasks such as state saving, load-balancing or data output 
(see Fig. El step 7). 

However, most of all parallel loops do not include synchronization of the processes, but only communication between 
them. 



E. Parallel Performance 

First of all, parallelization imposes several performance penalties on the algorithm, among them communication 
overhead, idle time, state saving, and rollbacks after erroneous computation steps. On the other hand, the computa- 
tional costs are shared among several processes. If the latter outweighs the former, parallelization can be considered 
succesful. The same holds true for memory requirements. State saving makes a copy of the whole system state 
and therefore doubles the memory usage. But here as well parallelization combines the limited resources of single 
processes. 

Figs. 16171 show that the speed-up of the parallelization (i. e. the reciprocal of the simulation time) for fixed system 
size is approximately proportional to P^/^, where P is the number of processes. Furthermore, as shown in Fig. (HI 
the parallelization is also scalable if the number of processes is chosen proportional to the system size. There are 
deviations for small P when the effect of the parallelization overhead is large. In addition, it has been shown in 
that the error recovery method presented in section IlII CI is not scalable for P ^ cxo. But for practical purposes, at 
least until P — 128, the algorithm remains scalable. 

We have tested the parallel algorithm on a computer cluster with Pentium III 650 MHz processors and a 1.28 GBit/s 
switched LAN. On a supercomputer with optimized communication hardware, scalability should be even better. 

For real physical applications with more than 10^ collisions per particle, the maximal number of particles typically 
is limited by both available memory (see Fig. and computing time to about 10^ particles per process, i. e. a total 
number of particles of about 10*. 

IV. SUMMARY 

We have demonstrated how to parallelize event-driven molecular dynamics successfully. Our algorithm is based on 
some ideas from [T^ . i. e. we have used an optimistic parallelization approach which performs a rollback protocol if a 
causality error occurs. But the algorithm is enhanced in several ways: 

Firstly, we have implemented dynamic load-balancing, which makes simulation of inhomogeneous systems possible. 
Computing time is further reduced by an adaptive linked-cell structure which determines the optimal cell sizes. 

Secondly, we have transferred the shared memory approach of |l2j | to distributed memory architectures as well. 
The parallelization has been realized with MPI. In order to minimize idle time, we have made use of asynchronous 
communication, i. e. send and receive actions are decoupled from each other. In addition, the amount of communication 
is limited to a minimum: The event processing can continue steadily until a border zone event takes place on the 
local process or is expected to take place on a neighboring process. Only then the calculation has to be interrupted in 
order to communicate with the neigboring processes. So, even on a cluster with rather poor communication hardware, 
parallelization yields a speed-up proportional to P^/"^ — at least up to P = 128 parallel processes. 

Thirdly, we have optimized the data structure of the algorithm. With event-driven molecular dynamics insufficient 
memory is often a more serious problem than computing time. In total our optimizations reduce memory requirements 
to one third as compared to the method proposed in This enabled us to perform simulations of real physical 

problems with up to 10* particles. 
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FIG. 6: Speed-up for different numbers P of processes in 2D. N = 5 ■ 10^ particles, volume fraction v = 0.3, C = 1024^ w 10® 
cells. The dashed line has a slope of 0.49. The data point for 1 process deviates from the dashed line, because the serial 
algorithm has no communication overhead. Note the logarithmic axes. 
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FIG. 7: Speed-up for different numbers P of processes in 3D. iV = 2 • 10® particles, volume fraction v = 0.25, C = 128^ « 2 • 10® 
cells. The dashed line has a slope of 0.45. 
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FIG. 8: Speed-up for different numbers P of processes in 3D. N/P = 5-10* particles per process, volume fraction i/ — 0.2. The 
dashed line has a slope of 0.50. 
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FIG. 9: Memory requirements per process for different numbers P of processes in 3D. The points represent the measured values 
for AT = 2 • 10^. The curves are obtained by a simple estimate ciN/P + C2{N / P)'^^^ + csAf, where the first term refers to real 
particles, the second term to virtual particles, and the third term to global data. The data point for 1 process deviates from 
the estimated value, because the serial algorithm has no need for state saving. 



